Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

Shantenu Jha and Andre Luckow

The tutorial material is available as iPython notebook on Github:

https://github.com/radical-cybertools/supercomputing2015-tutorial

Requirements and Setup:

For the purpose of this tutorial we setup a Hadoop cluster and iPython Notebook environment on Amazon Web Services (not active after tutorial):

Jupyterhub/iPython: http://sc15.radical-cybertools.org/
YARN: http://yarn-aws.radical-cybertools.org:8088/
HDFS: http://hdfs-aws.radical-cybertools.org:50070/
Ambari: http://ambari-aws.radical-cybertools.org:8080/

Enclosed a list of dependency for installation on other machines:

iPython
Numpy
Pandas
Scikit-Learn
Matplotlib, Seaborn

We recommend to use Anaconda.

1. Hadoop and Spark Introduction

We begin with an overview of using Hadoop and Spark:

Hadoop MapReduce: Link to Notebook

Spark: Link to Notebook

2. Pilot-Abstraction for distributed HPC and Apache Hadoop Big Data Stack (ABDS)

The Pilot-Abstraction has been used to execute task-based workloads on distributed resources. A Pilot-Job is a placeholder job that is submitting to the resource management system and is used as a container for a dynamically determined set of compute tasks. The Pilot-Data abstraction extends the Pilot-Abstraction for supporting the management of data in conjunction with compute tasks.

The Pilot-Abstraction supports heterogeneous resources, including cloud, HPC, and Hadoop resources.

Pilot Abstraction

The following example demonstrates how the Pilot-Abstraction is used to manage a set of compute tasks.

Link to Notebook

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

Requirements and Setup:

1. Hadoop and Spark Introduction

2. Pilot-Abstraction for distributed HPC and Apache Hadoop Big Data Stack (ABDS)

3. Advanced Analytics

4. Future Work: Midas